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MULTIPLE HYPOTHESIS TESTING ADJUSTED FOR LATENT 
VARIABLES, WITH AN APPLICATION TO THE AGEMAP GENE 
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Stanford University 

In high throughput settings we inspect a great many candidate 
variables (e.g., genes) searching for associations with a primary vari- 
able (e.g., a phenotype). High throughput hypothesis testing can be 
made difficult by the presence of systemic effects and other latent 
variables. It is well known that those variables alter the level of tests 
and induce correlations between tests. They also change the rela- 
tive ordering of significance levels among hypotheses. Poor rankings 
lead to wasteful and ineffective follow-up studies. The problem be- 
comes acute for latent variables that are correlated with the primary 
variable. We propose a two-stage analysis to counter the effects of 
latent variables on the ranking of hypotheses. Our method, called 
LEAPP, statistically isolates the latent variables from the primary 
one. In simulations, it gives better ordering of hypotheses than com- 
peting methods such as SVA and EIGENSTRAT. For an illustration, 
we turn to data from the AGEMAP study relating gene expression 
to age for 16 tissues in the mouse. LEAPP generates rankings with 
greater consistency across tissues than the rankings attained by the 
other methods. 

1. Introduction. There has been considerable progress in multiple test- 
ing methods for high throughput applications. A common example, coming 
from biology, is testing which of N genes' expression levels correlate sig- 
nificantly with a scalar variable, which we'll call the primary variable. The 
primary variable may be an experimentally applied treatment or it may be 
a covariate such as a phenotype. We will use the gene expression example 
for concreteness, although it is just one of many instances of this problem. 

High throughput experiments may involve thousands or even millions of 
hypotheses. Because iV is so large, serious problems of multiplicity arise. For 
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independent tests, methods based on the false discovery rate [Dudoit and 
van der Laan (2008)] have been very successful. Attention has turned more 
recently to dependent tests [Efron (2010)]. 

One prominent cause of dependency among test statistics is the presence 
of latent variables. For example, in microarray-based experiments, it is well 
known that samples processed in the same batch are correlated. Batch, tech- 
nician and other sources of variation in sample preparation can be modeled 
by latent variables. Another example comes from genetic association stud- 
ies, where differences in ancestral history among subjects can lead to false 
or inaccurate associations. Price et al. (2006) used principal components 
to extract and correct for ancestral history, in effect modeling the genetic 
background of the subjects as latent variables. A third example comes from 
copy number data, where local trends along the genome cause false positive 
copy number calls [Olshen et al. (2004)]. Diskin et al. (2008) conducted ex- 
periments showing that these local trends correlate with the percentage of 
bases that are guanines or cytokines along the genome, and are caused by 
differences in the quantity and handling of DNA. These laboratory effects 
are hard to measure, but can be quantified using a latent variable model. In 
this paper, we consider latent variables that might even be correlated with 
the primary variable. 

When the primary variable is an experimentally applied treatment, then 
problematic latent variables are those that are partially confounded with 
the treatment. Randomization reduces the effects of such confounding, but 
randomization is not always perfectly applied and batch or other effects may 
be imbalanced with respect to the treatment [Leek et al. (2010)]. 

These latent variables have some severe consequences. They alter the level 
of the hypothesis tests and they induce correlations among multiple tests. 
Another consequence, that we find especially concerning, is that the latent 
variables may affect the rank ordering among the N p- values. When high 
throughput methods are used to identify candidates for further follow-up it 
is important that the highly ranked items contain as many nonnull cases as 
possible. 

Our approach to this problem uses a rotated model in which we separate 
the latent variables from the primary variable. We do this by creating two 
data sets, one in which both primary and latent variables are present and 
one in which the primary variables are absent. We use the latter data set 
to estimate the latent variables and then substitute their estimates into the 
former. Since each gene has its own effect size in relation to the primary 
variable, the former model is supersaturated. We conduct inference under 
the setting where the parameter vector relating the genes to the primary 
variable is sparse, as is commonly assumed in multiple testing situations. 
Each nonnull hypotheses behaves as an additive outlier, and we then apply 
an outlier detection method from She and Owen (2011) to find them. We call 
the method LEAPP, for Zatent effect adjustment after primary projection. 
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Section 2 presents our notation and introduces LEAPP along with several 
other related models, including SVA [Leek and Storey (2008)] and EIGEN- 
STRAT [Price et al. (2006)], to which we make comparisons. Section 3 
shows via simulation that LEAPP generates better rankings of the non- 
null hypotheses than one would get by either ignoring the latent variables, 
by SVA, or by EIGENSTRAT. EIGENSTRAT estimates the latent variables 
(by principal components) without first adjusting for the primary variable. 
LEAPP outperforms it when the latent variable is weaker than the primary 
EIGENSTRAT does well in simulations with weak primary variables, which 
matches the setting that motivated it. Still it is interesting to learn that 
it does not extend well to problems with strong primary variables. SVA 
estimates the primary variable's coefficients without first adjusting for cor- 
relation between the primary and latent variables. LEAPP outperforms it 
when the latent and primary variables are correlated. 

Section 4 compares the methods on the AGEMAP data of Zahn et al. 
(2007). The primary variable there is age. While we do not know the truly 
nonnull genes for this problem, we have a proxy. The data set has 16 subsets, 
each from a different tissue type. We find that LEAPP gives gene lists with 
much greater overlap among tissues than the gene lists achieved by the other 
methods. Our conclusions are in Section 5. We include some brief remarks on 
calibration of the p- values themselves as opposed to the rank ordering which 
is the primary focus of this paper. Some theory is given in the Appendix for 
a simplified version of LEAPP. The specific rotation matrix used does not 
affect our answer. For the case of one latent variable and no covariates, the 
simplified LEAPP consistently estimates the latent structure. We also get a 
bound for the sum of squared coefficient errors when the effects are sparse. 

2. Notation and models. In this section we describe the data model and 
introduce the parameters and latent variables that arise. Then we describe 
our LEAPP proposal which is based on a series of reductions from a het- 
eroscedastic multivariate regression including latent factors to a single linear 
regression problem with additive outliers and known error variance. We also 
describe EIGENSTRAT and SVA, to which we make comparisons, and then 
survey several other published methods for this problem. 

2.1. Data, parameters, latent variables and tests. The data we observe 
are a response matrix Y G ]R Arxn and a variable of interest g G W 1 , which 
we call the primary variable. In an expression problem Yij is the expression 
level of gene i for subject j. Very often the primary variable g is a group 
variable taking just two values, such as ±1 for a binary phenotype, then 
linearly transformed to have mean and norm 1. The quantity gj can also 
be a more general scalar, such as the age of subject j. 

We are interested to know which genes, if any, are linearly associated with 
the variable g. We capture this linear association through the N x n matrix 
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jg , where 7 is a vector of ./V coefficients. When most genes are not related 
to g, then 7 is sparse. 

Often there are covariates X other than g that we should adjust for. The 
covariate term is fiX T where f3 contains coefficients. The latent variables that 
cause tests to be mutually correlated are assumed to take an outer product 
form UV T . Neither U nor V is observed. Finally, there is observational noise 
with a variance that is allowed to be different for each gene, but assumed to 
be constant over subjects. 

The full data model is 

(2.1) Y = 75 T + /3X T + UV T + 

for variables 



Y 


g 


R Nxn 


response values, 


9 


G 


R nxl 


primary predictor, that is, treatment, with g T g 


7 


g 


R Nxl 


primary parameter, possibly sparse, 


X 


G 


R nxs 


s covariates (e.g., sex) per subject, 


P 


G 


R Nxs 


s coefficients, including per gene intercepts, 


u 


G 


R Nxk 


latent, nonrandom rows (e.g., genes), 


V 


G 


R nxk 


latent, independent rows (e.g., subjects), 


E 






I n ) noise 



and 

S = diag(o"i, . . . , cjn) standard deviations 
with dimensions 

n number of arrays/subjects, 
N 3> n number of genes, 
s <C n number of covariates 

and 

k > 1 latent dimension. 

After adjusting for X, the genes are correlated through the action of the 
latent portion UV T of the model. They may have unequal variances, through 
both X and U. We adopt the normalization E(1/ T V) = I/.. It is possible to 
generalize the model to have a primary variable g of dimension larger than 
one, but we focus on the case of a single primary variable. 

We pay special attention to the case of k = 1 latent variable. The algo- 
rithm is the same for all values of k. But, when k = l, the dependence be- 
tween the variable g of interest and the laten t vari able V can be summarized 
by a single correlation coefficient p = g T V/VV T V which aids interpretation. 
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Writing (2.1) in terms of indices yields 

(2.2) Yij = ll9j + pJXj + UjVj + a iEij , 1 < % < N, 1 < j < n. 

Here ft and Ui are the ith rows of /? and U, respectively, as column vectors. 
Similarly, Xj and Vj are the jth rows of X and V, o~i is the ith diagonal 
element of £ and Eij is the ij element of E. 

Our LEAPP proposal is based on a series of reductions described next. 
In outline, we first split the data into two parts, one of which is completely 
free of the primary variable. We then estimate some properties of the latent 
variable model from that primary-free data. Finally, we use those estimated 
quantities in the part of the data which does contain the primary variable 
to identify genes related to the primary variable. 

2.2. Data rotation. We begin by choosing an orthogonal matrix O 6 
R nxn guch that g jQ\ = o, 0, . . . , 0) G R lxn where r] = \\g\\ > 0. Without 
loss of generality, we assume that the primary predictor has been scaled 
so that rj = 1. A convenient choice for O is the Householder matrix O = 
I n — 2kk t , where k= (g — e\)/\\g — ei 1 1 2 and e\ = (1,0, . . . ,0) T . 

Using O, we construct the rotated model 

(2.3) Y (r) = YO T = 75 T T + /3A t O t + UV t O t + ZEO T 

(2.4) = 7</ r ) T + f3X^ J + UV {r)T + ££ (r) , 

where gW , X^ , and are rotated versions of g, X, V and E, 
respectively. For each major transformation of the data, a new mnemonic 
superscript will be introduced. Some superscripts use the same letter also 
used as a data dimension, but the usages are distinct enough that one will 
not be mistaken for the other. 

Notice that E^ = EO T = E, because E ~ J\f(0,lN®I n ). By construction, 

g( r ) = (1,0,..., 0). Therefore, the model for Y^p is different depending on 
whether j = 1 or j 7^ 1: 

(2-5) Y^=pJxP+UjV^+ ll + a^ 

and 

(2.6) YU = PjX^ + UjV^+ * t e$ , J = 2, . . . , n, 

where is the (i,j)th element of E^ r \ 

The rotated model concentrates the primary coefficients 7, in the first 
column of Y ^ . Our approach is to base tests and estimates of 7^ on equation 

(2.5) . We need to substitute estimates for unknown quantities cij, ft and Ui 
in (2.5). The estimates come from the model in equation (2.6). 

This rotated approach has some practical advantages: First, we do not 
need to iterate between applying equations (2.5) and (2.6). Instead we use 
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(2.6) once to estimate unknowns U, a and f3 and then use (2.5) once to judge 
7i. Second, the last n — 1 columns of Y^ r \ and hence estimates a, P, and U, 
are statistically independent of the first column. Third, problems (2.5) and 
(2.6) closely match settings for which there are usable methods as described 
next. 

Using estimates U and Pi from (2.6) described below, we may write 
(2.5) as 

(2-7) yW-^Jr{ r) = W ) +7i+^g ) . 

The right-hand side of equation (2.7) is a regression with measurement errors 
in the predictors Ui, mean-shift outliers ji and unequal error variances. We 
will use the O-IPOD algorithm of She and Owen (2011), adjusted to handle 
unequal crj, to get our estimate of 

Before describing 0-IPOD we show how to get the estimates Pi, U and 
Ui from the criss-cross regression algorithm of Gabriel and Zamir (1979). 

(r) 

Criss-cross regression will also produce an estimate of Vj for j > 2, but 
those vectors do not play a role in (2.7). 

2.3. Estimating U , f3 and E. We get our estimates of U, Pi and o"j from 
the last n-1 columns of the data set. Let V w and be the 

last n — 1 columns of Y^ r \ X( r > , and E^ r \ respectively. Then the model 
for the last n—1 columns of the data is 

(2.8) Y W = PX ^ T + UV WT + VE® . 

Notice that the quantities (3, U and E in (2.8) are the same as those in the 
original model (2.1) because the steps taken so far operate on columns of Y. 
We can write = Y^D n where D n = "J e ]R nx ( n - 1 ) and similarly for 

X^ and The matrix D n deletes the first column out of n in the matrix 
that it follows. 

We adopt an iterative approach based on (2.8) that alternates between 
updating E and updating the quantities P, U and V^> given E. The update 
for E is 

_ / i \ 1/2 

(2.9) E = f — - diag(e? T ) J where e = Y W - pX® - UV WT . 

That is, <7? is simply the mean squared error of a regression for the ith gene. 

Given E, we standardize the last n — 1 columns, yielding Y^ = E -1 Y^. 
In terms of the other variables, 

(2.10) = p^X^ T + U^V^ T + E^\ 

where p& = E _1 /3, = E" 1 ^ and EW = XT 1 ^ are standardized ver- 
sions of P, U and E^\ respectively. 
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Because T,~ 1 E^ has IID Gaussian entries, equation (2.10) closely matches 
the criss-cross regression model of Gabriel and Zamir (1979). Criss-cross re- 
gression for a matrix of data sums three outer products: row based features 
(with column coefficients), column based features (with row coefficients), 
and a low rank factor model with latent rows and columns. 

We fit a criss-cross regression by first estimating ft 8 ) by least squares 
regression: 

fts) =Y (si) x (e)f X (e)T X (e)yi_ 

Then we estimate and by a truncated singular value decomposition 
(SVD) of rank k applied to the residuals ^ = - ft s *>X^ T . We absorb 
the singular values into but retain the identity V^ T V^ = I k . 

Our use of criss-cross regression has a latent factor model of the form 
UV T and terms of the form (3X T representing column features with row 
coefficients. The full criss-cross regression model also allows for terms of the 
form Z5 T that combine row features with column coefficients. 

To apply the algorithm, we need a starting point for the iteration and 
a value of k. We start with £ = Jjy We have assumed that the rank k for 
the latent variables is known. When it must be estimated from the data, we 
follow Leek and Storey (2008) in using the method of Buja and Eyuboglu 
(1992), as described in Section 2.5. 

Criss-cross regression gives us estimates S, ft 8 ) and U^ s \ We can estimate 
j3 by S 1 / 2 /^) and U by E 1 / 2 ^ 5 ). We will use these estimates normalized 
by crj and so it is also possible to work with ft s ) and U^ s ) themselves. 

2.4. Gene identification. Now we return to the first column of the ro- 
tated data matrix which contains the effects of the primary variable. If we 

(r) 

divide by ai, we get 

(2.11) ^ = ^X[ r) + ^LyW + ^ + e g>, i = l,...,N. 

o~i ai ai ai 

For our purposes, equation (2.11) can be cast as a regression of standard- 
ized variables on k predictors Ui/ai with coefficient vector £ with 

additive outliers 7i/<7j and offsets 0J x[ r ^ /a^. Though <7j and fa and f7, are 
unknown, we have estimates of them from the previous section. 

We use those estimates to construct the primary variable regression model 

(2.12) y.(P) =C7 WT F W +7 W +£ (f) 

with response = (Y^ - pTx[ r) )/ai, predictors uf 1 = 1% = U$ , 

coefficient vector = V± r \ additive outliers 7$ = ji/ai, and error = 
(r) /- 
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The G-IPOD algorithm of She and Owen (2011) is designed to estimate 
a regression coefficient in the presence of additive outliers as well as to 
identify which observations are outliers. In the present context, the outliers 
correspond to genes that are associated with the primary variable. 

For a complete description of © IPOD see She and Owen (2011), who 
also cite related work in the robust regression literature. Here we give a 
brief account of the main points. 

The primary variable model (2.12) could be fit by minimizing \\Y^ — 
+ A ||7( p ) || over V} p) and 7W. Large enough penalties A > would 
yield a sparse estimate of which is desirable because the model has N + k 
parameters and only N observations. 

The natural algorithm to minimize the sum of squared errors with an L\ 
penalty on the additive outlier coefficients alternates between two steps. One 
step estimates the additive outlier effects by soft thresholding residuals from 
a least squares regression. The other step does the least squares regression 
after first subtracting the estimated outlier effects. She and Owen (2011) 
found that while soft thresholding is not robust, simply changing the algo- 
rithm to do hard thresholding proved to be very robust. Their algorithm 
also takes account of the leverage values in least squares regression. The 
algorithm requires a choice for A. They used a modified BIC statistic from 
Chen and Chen (2008). 

Our statistic for testing H{q : ji = is 

(2.13) Ti = -! ^ -, 

r 

where V\ is the O-IPOD estimate of V± and r is an estimate of the error 
variance from (2.12). The estimate r is the median absolute deviation from 

the median (MAD) of — U^ T Vi, with the customary scaling to match 
the standard deviation for a Gaussian distribution. 

For p- values we use Pr(|Z| > |Tj|) where Z ~ A/"(0, 1). Candidate hypothe- 
ses are ranked from most interesting to least interesting by taking the p- 
values from smallest to largest. This is equivalent to sorting \Ti\ from largest 
to smallest. We consider the quality of this ordering, not whether the p- values 
are properly calibrated, apart from a brief remark in the conclusions. 

The entire LEAPP algorithm is summarized in Figure 1. 

We have emphasized the setting in which 7 is a sparse vector. When 7 is 
not a sparse vector, then its large components may not be flagged as outliers 
because the MAD estimate of r would be inflated due to contamination 
by 7. In this case, however, we can fall back on a simpler approach to 

estimating r. The error has variance K(af /af). This variance differs from 
unity only because of estimation errors in <7j. We can then use r 2 = 1. We 
can account for fitting s regression parameters to the n — 1 samples in each 
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(1) Standardize the primary variable, g = g/\\g\\- 

(2) Define the rotation matrix O = I n — 2kk t for k = (g — ei)/\\g — ex || - 

(3) Rotate FW = YO T and = XO T . 

(4) Select the last n - 1 columns = Y (r) £>„ and jW = X^D n . 

(5) Let ^)=FW T XW(XW T XW)- 1 . 

(6) Use Buja and Eyuboglu (1992) to estimate the rank k for Yw — 

(7) Set S = /jv. 

(8) Iterate to convergence: 

(a) y(»0 = s- 1 yW. 

(b) =yM T xW(JfW T JfW)- 1 . 

(c) i?^ gets rank fc truncated SVD of EW = YW - f3^X^ T . 

(d) S = (diag(( J g^ - E^KeM - Ei? e) ) T )/(n - I)) 1 ' 2 . 

(9) Let CA S ) be the fc right singular vectors of E^ . 

(10) Set /3 = S^ S ), C/ = SC/( S ). 

(11) Set yW = (yW - jW)/3„ ^ = ^ s) . 

(12) Fit G-IPOD with response y} p) predictors getting 7^, v} p) and 

(13) Let T % = (y ^ - U^ r V^)/r, i = 1, . . . ,N . 

(14) Rank genes from most significant (largest |Tj|) to least. 

Fig. 1. The LEAPP algorithm, using notation from the text. Step (6) can be 
omitted if the desired value of k is already known. Step (8)(d) is written con- 
cisely but can be computed more efficiently. We use |Ti| to rank genes. Conver- 
gence at (8) is declared when ||E ncw — £ id||i/||£oid||i < 1CP 4 with || ■ ||i here be- 
ing the sum of absolute diagonal elements. There is an R package for LEAPP at 
http : // cran. r-project . org/ web/ packages/ leapp/ . 

row of yW by taking r 2 = E((n - 1 - s)/x 2 l _ 1 _ s ) = {n - s - l)/(n - s — 3). 
A further approximate adjustment for estimating k latent vectors is to take 
t 2 = (n — s — k — l)/(n — s — k — 3). This estimate of r can be used in (2.13) 
for ranking of hypotheses if 7 is not suspected to be sparse. 

2.5. SVA. We compare our method to the surrogate variable analysis 
(SVA) method of Leek and Storey (2008). Their iteratively reweighted sur- 
rogate variable analysis algorithm adjusts for latent variables before doing 
a regression. But it does not isolate them. 

A full and precise description of SVA appears in the supplementary in- 
formation and online software for Leek and Storey (2008). Here we present 



10 



Y. SUN, N. R. ZHANG AND A. B. OWEN 



a brief outline. Their model takes the form 

Y = jg T + UV T + EE, 

where UV T is their "dependence kernel" and E is not necessarily normally 
distributed but has independent rows. 

The SVA algorithm uses iteratively reweighted SVDs to estimate U, V 
and 7. The weights are empirical Bayes estimates of Pr(7j = 0, U{ 7^ | 
Y, g, V) from Storey, Akey and Kruglyak (2005). Their method seeks to re- 
move the primary term 75 T by downweighting rows with 7^ 7^ 0. Our method 
creates columns that are free of the primary variable by rotation. 

The SVA iteration is as follows. First, they fit a linear model without any 
latent variables, getting estimates 7 and the residual R = Y — ^yg T . Second, 
they apply the simulation method of Buja and Eyuboglu (1992) to R to 
estimate the number k of factors, and then take the top k right eigenvec- 
tors of R as the initial estimator V. Third, they form the empirical Bayes 
estimates Wi = Pr(7j = 0, Ui 7^ | Y,g, V) from Storey, Akey and Kruglyak 
(2005). Fourth, based on those weights, they perform a weighted singular 
value decomposition of the original data matrix Y , where row i is weighted 
by Wi. The weighted SVD gives them an updated estimator V. They repeat 
steps (3) and (4), revising the weights Wi and then the matrix V, until V 
converges. They perform significance analysis on 7 through the multivariate 
linear regression model 

Y = jg T + UV T + EE, 

where V is treated as known covariates to adjust for the primary effect g. 

To estimate the number k of factors in the SVD, they use a simulation 
method of Buja and Eyuboglu (1992). That algorithm uses Monte Carlo 
sampling to adjust for the well-known problem that the largest singular value 
in a sample covariance matrix is positively biased. That method has two pa- 
rameters: the number of simulations employed and a significance threshold. 
The default significance threshold was 0.1 and the default uses 20 permuta- 
tions. 

2.6. EIGENSTRAT. EIGENSTRAT [Price et al. (2006)] was developed 
to control for differences in ancestry in genetic association studies, where the 
matrix Y represent the alleles carried by the subjects at the genetic markers 
(e.g., Yij S {0,1,2} counts the number of one of the alleles). The primary 
variable can be case versus control, disease status or other clinical traits. 

In our notation, they begin with a principal components analysis approx- 
imating Y by UV T for U G R Nxk and V £ R nxk . Then for i = 1, . . . , N they 
test whether Y^i :n is significantly related to g in a regression including the 
k columns of V or, equivalently, whether the partial correlation of Yn :n 
on g, adjusted for V, is significant. Although the data are discrete and the 
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method resembles one for Gaussian data, the results still clearly obtain la- 
tent variables showing a natural connection to the geographical region of 
the subjects' ancestors. 

EIGENSTRAT has an apparent weakness. If the signal r yg T is large, then 
its presence will corrupt the estimates of U and V. The estimate V will 
be correlated with the effect g that we are trying to estimate a coefficient 
for. Indeed, we find in our simulations of Section 3 that EIGENSTRAT 
performs poorly when the signal is large compared to the latent variable. 
While EIGENSTRATs strong latent with weak signal assumption seems to 
be appropriate for genetic association studies, a method that does not rely 
on such assumptions is desirable. 

EIGENSTRAT also requires the choice of a rank k for the latent term. 
Price et al. (2006) describe a default choice of k = 10. Patterson, Price and 
Reich (2006) apply a spiked covariance model test of Johnstone (2001) using 
the Tracy- Widom distribution [Tracy and Widom (1994)]. 

2.7. Other methods. We have used Eigenstrat and SVA in our compar- 
isons because they are widely used in applications. A number of other meth- 
ods have been proposed for this problem. It is not feasible to include them 
all in our numerical comparisons. Instead we describe several of them here, 
relating their approaches to the notation of Section 2.1. 

Friguet, Kloareg and Causeur (2009) model their data as Y = jg T + 
UV T + TiE. They assume the latent V is normally distributed (indepen- 
dent of E) and that U is nonrandom. They do not assume sparsity for 7. 
They estimate U, V, 7 and £ by an EM algorithm. They find that using 
V in an FDR procedure is an improvement compared to a model that does 
not employ latent variables. 

Lucas, Kung and Chi (2010) take Y = f3X T + UV T + Y.E and make ex- 
tensive use of sparsity priors. They include the primary variable g as one of 
the columns of X, instead of singling it out as we do. Under their sparsity 
priors, a coefficient is either or it is AA(0,r 2 ). The probability of a nonzero 
coefficient is ir, which in turn has a Beta distribution with a small mean. 
They apply sparsity priors to the elements of both the coefficient matrix j3 
and the latent variables U. The parameters tt and r are different for each 
column of (3. They use Markov chain Monte Carlo for their inferences. 

Allen and Tibshirani (2010) model the data as Y = "/g T + E where E ~ 
•A/"(0, S<8>r). That is, the noise covariance is of Kronecker form which models 
dependence between rows and between columns. Our model has a different 
variance equal to the sum of two Kronecker matrices, one from UV T and 
one from TiE. They estimate their S and V by maximum likelihood with a 
penalty on the normjjf the inverses of S and V. Their Li penalties encourage 
sparsity in S _1 and They then whiten Y using T and £ and apply false 
discovery rate methods. They also show that correlations among different 
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columns lead to incorrect estimates of FDR, while correlated rows do not 
much affect the estimates of FDR. 

Efron (2007) proposed a method to fit an empirical null to the data to 
directly account for correlations across arrays. The empirical null method 
works with estimated Z scores (one per gene) and uses the histogram of 
those scores to account for the effects of latent variables. This process adjusts 
significance levels for hypotheses but does not alter their ordering. 

Carvalho et al. (2008) consider similar problems but apply a very different 
formulation. They treat the primary variable (our g) as the response and 
use the data matrix (our Y) as predictors. 

2.8. Rank estimation. The problem of choosing the number k of latent 
variables is a difficult one that arises for all the methods we used. The Tracy- 
Widom strategy is derived for the case with S = <jIn , while our motivating 
applications have heteroscedasticity. 

Even for £ = al^ it is known that the best rank for estimating UV T is not 
necessarily the true rank. There is a well-known threshold strength below 
which a factor is not detectable and Perry (2009) shows that there is a still 
higher threshold below which estimating that factor worsens the estimate 
of UV T . Owen and Perry (2009) present a cross- validatory estimate for the 
rank k and Perry (2009) shows how to tune it to choose a rank k that gives 
the best reconstruction as measured by the Frobenius norm. 

In our numerical comparisons, LEAPP, SVA and EIGENSTRAT were all 
given the same rank k to use. Sometimes k was fixed at a default value. 
Other times we used the method of Buja and Eyuboglu (1992). 

3. Performance on synthetic data. In this section we generate data from 
the model (2.1) and compare the results from the algorithms to each other, to 
an oracle which is given the latent variable, and to a raw regression method 
which makes no attempt to adjust for latent variables. Some simulations by 
Sun (2011) made under a different model are described in Section 5. 

We choose s = 0, omitting the (3X T covariate term, so the simulated data 
satisfy 

(3.1) Y = 7/ + UV T + EE. 

The model (3.1) is a special case of both the LEAPP model and the SVA 
model. 

Our simulations have n = 60 (subjects) and N = 1000 (genes). Our pri- 
mary covariate is a binary treatment vector g ex (1, . . . , 1, —1, . . . , —1), with 
equal numbers of 1 and —1, normalized so that g T g = 1. 

The vector 7 of treatment effects has independent components ji taking 
the values c > and with probability ir = 0.1 and l — ir = 0.9, respectively. 
We chose c in order to attain specific signal to noise ratios as described 
below. The matrix £ is a diagonal with nonzero entries o~i sampled indepen- 
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dently from an inverse gamma distribution: 1/cr 2 ~ Gamma(5)/4. Note that 
E(a 2 ) = 1. 

We use k = 1 latent variable that has correlation p with g. The latent 
vector U = (u±, . . . , ujv) is generated as independent U(—a, a) random vari- 
ables. We will choose a to obtain specific latent to noise variance ratios. 
The latent vector V is taken to be pg + \/l — p 2 W, where W is uniformly 
distributed on the set of unit vectors orthogonal to g. That is, we sample V 
so as to have a sample correlation and squared norm that both match their 
population counterparts. 

The model (3.1) gives Y three components: the signal S = "yg T , the latent 
structure C = UV T , and the noise M = "EE. The relative sizes of these com- 
ponents affect the difficulty of the problem. We use Frobenius and spectral 
norms to describe the sizes of these matrices. 

The noise matrix is constructed so that K(a 2 £ 2 j) = E(cr 2 ) = 1, so that 
E(||jV||^) = Nn. Because the signal and latent matrices have rank 1, 

(3.2) ms\\i)=ms\\i)=nh\\l)=Nvc 2 

and 

(3.3) E(||£|||)=E(||£||i) = E(||C/|||) = iVa 2 /3. 
For our simulation, we specified the ratios 

SNR = vrc 2 and LNR = a 2 /3 

and varied them over a wide range. We also use SLR = 37rc 2 /a 2 . 

We also varied the level of p, the correlation between the latent and pri- 
mary variables. For each setting of SNR, SLR, LNR and p under consider- 
ation, we simulated the process 100 times and prepared ROC curves, from 
the pooled collection of 100,000 predictions. 

The methods that we applied are as follows: 

true an oracle given UV T which then does regression of Y — UV T on g, 

raw multivariate regression of Y on g ignoring latent variables, 

eig EIGENSTRAT of Price et al. (2006), 

sva surrogate variable analysis from Leek and Storey (2008), and 

lea our proposed LEAPP method. 

The ROC curves for two sets of conditions are shown in Figure 2. The 
best performance is always from the oracle. The next best method is LEAPP. 
For the conditions in the left panel RAW is next best followed by SVA and 
EIGENSTRAT. In the right panel SVA is third, followed by EIGENSTRAT 
and then RAW. 

Because the ROC curves from the simulations have few if any crossings, 
we can reasonably summarize each one by a single number. We have used 
the area under the curve (AUC) for a global comparison. We also use a 
precision measure for the quality of the most highly ranked values. That 
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ROC for rho = 1/2, SLR = 1/2, SNR = 1 ROC for rho = 1/2, SLR = 1/4, SNR = 1 
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Fig. 2. This figure shows the knee of the ROC curves for two simulations with p= 1/2 
and SNR = 1. The left panel has SLR — 1/2. In this case the raw method beats SVA which 
beats EIGENSTRAT. The right panel has SLR = 1/4 and SVA beats EIGENSTRAT which 
beats the raw method. In every case we simulated, the best results are for an oracle that was 
given the latent variables. The second best was always for the proposed LEAPP method. 
The relative performance for SVA, EIGENSTRAT and the raw method were different in 
other settings. 



measure is the fraction of truly nonnull genes among the highest ranking H 
genes. We use H = 50. 

When p = 0, EIGENSTRAT, SVA and LEAPP have almost equivalent 
performance. For p > 0, the oracle always had the highest AUC and LEAPP 
was always second. The ordering among the other three methods varied. 
Sometimes EIGENSTRAT was the best of those three, other times SVA was 
the best of those three and other times RAW was the best of those three. 

Figure 3 shows a heatmap of the improvement in AUC for LEAPP versus 
SVA. The improvements are greatest when p is large. This is reasonable 
because SVA is not designed to account for correlation between the latent 
and primary variables. At each correlation level, the greatest differences arise 
when SNR is small and LNR is about 2. 

Figure 4 shows the improvement in AUC for LEAPP versus EIGEN- 
STRAT. The improvements are largest when the primary effect is large. 

The improvements versus SVA are smaller than those versus EIGEN- 
STRAT. To judge the practical significance of the improvement, we repeated 
some of these simulations for SVA, increasing n until SVA achieved the same 
AUC that LEAPP did. Sometimes SVA required only 2 more observations 
(one treatment and one control) to match the AUC of LEAPP. Sometimes 
it was unable to match the AUC even given double the sample size, that is, 
n = 120 observations instead of n = 60. Not surprisingly, the advantage of 
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AUG vs SVA 



rho=0.2 rho=0.4 
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SNR SNR 

Fig. 3. This figure shows the improvement in AUC for LEAPP relative to SVA. 
Here p is the correlation between the primary and latent variables. The signal to noise 
ratio and latent to noise ratio are described in the text. The color scheme encodes 
(AUCi oa -AUC sva )/AUC sva . 

LEAPP is greatest when the latent variable is most strongly correlated with 
the primary. 

Table 1 shows a feature of this problem that we also see in the figures. 
The improvement over SVA is quite small when LNR = 0.5. A small enough 
latent effect becomes undetectable, both methods suffer and there is little 
difference. Similarly, a very large latent effect (LNR = 8) is easy to detect by 
both methods. The largest differences arise for medium sized latent effects. 

High throughput methods are often used to identify candidates for future 
follow-up investigation. In that case we value high precision for the most 
highly ranked hypotheses. Figure 5 shows the improvement of LEAPP over 
SVA, as measured by precision. Figure 6 shows the improvement of LEAPP 
over EIGENSTRAT, as measured by precision. 




rho= 0.6 



rho= 0.8 




Fig. 4. This figure shows the improvement in AUC for LEAPP relative to 
EIGENSTRAT. The simulation conditions are as described in Figure 3. The color scheme 
encodes (AUC rot - AUC e ig)/AUC e i g . 



4. AGEMAP data. It is hard to find a real data set where the true set 
of important genes is known. Even if we are confident that a few genes 
are active, we still cannot be sure that the others are really inactive: the 
corresponding null hypotheses might be accepted, but they are not proved. 
We turn instead to the AGEMAP study [Zahn et al. (2007)]. 

The AGEMAP study [Zahn et al. (2007)] investigated age-related gene 
expression in mice. Ten mice at each of four age groups were investigated. 
From these 40 mice, samples were taken of 16 different tissues, resulting in 
640 microarray data sets. A small number of those 640 microarrays were 
missing. From each microarray, 8932 probes were sampled. Perry and Owen 
(2010) found that many of the tissues in this data set exhibited strong latent 
variables. Their approach assumed that the latent variables were orthogonal 
to the treatment. 
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Table 1 

This table shows the number of samples required for SVA to attain the same AUC that 
LEAPP attains with n = 60 samples. For example, with SNR — 2 and LNR = 0.5, and 
p = 0.25, SVA requires 66 samples or 10% more. The entries of 100% denote settings 
where the increase needed was >100% 
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Our underlying assumption is that aging should have partially though not 
totally consistent results from tissue to tissue. According to Kim (2008): 
"Some aspects of aging only affect specific tissues; examples include pro- 
gressive weakness of muscle, declining synaptic function in the brain, and 
decreased filtration rate in the kidney. Other aspects of aging occur in all 
cells regardless of their tissue type, such as the accumulation of oxidative 
damage, and telomere shortening." Zahn et al. (2006) found some genetic 
pathways with common age regulation in (human) kidney, brain and muscle. 
Rodwell et al. (2004) found common aging between human kidney, cortex 
and medulla. Some aspects of aging are also common from species to species 
Kim (2007). 

A tendency for some common component to aging should in turn produce 
overlap in gene lists computed from multiple tissues. Because age-related 
genes are sparse, noisy estimation is more likely to reduce overlap in gene 
lists than to create it. 

To illustrate this point, consider a setting with 1000 genes and two tissues 
A and B with counts 

A ^A 

B f 10 10 \ 
->B \ 10 970 J ' 
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Precision vs SVA 

rho= 02. rho= 0.4 
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Fig. 5. This figure shows the improvement in precision for LEAPP relative to SVA. 
Precision is the fraction of truly affected genes among the top H — 50 ranked genes. 
The simulation conditions are as described in Figure 3. The color scheme encodes 
(PRE loa - PRE sva )/PRE sva . 



Here 10 genes are truly age-related in both tissues, 10 are age-related in A 
but not B, and, finally, 970 genes are not age-related in either tissue. Suppose 
now that statistical testing identifies each truly age-related gene with power 
0.6 and that each nonage-related gene has a false discovery probability of 
0.01. Using A and B to represent genes identified as age-related, the expected 
counts (for independent test statistics) are in the following matrix: 

A ^A 
B ( 3.817 17.983 \ 
->B \ 17.983 960.217 / ' 
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Precision vs Eigenstrat 
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Fig. 6. This figure shows the improvement in precision for LEAPP relative to EIGEN- 
STRAT. Precision is the fraction of truly affected genes among the top H = 50 ranked 
genes. The simulation conditions are as described in Figure 3. The color scheme encodes 
(PRE ro t - PRE oig )/PRE eig . 



The effect of noisy gene identification is severely biased toward reducing the 
apparent overlap. 

For any two tissues, we can measure the overlap between their sets of highly 
ranked genes. For two sets A and B, their resemblance [Broder (1997)] is 

ves(A,B) = \- -}, 

where | • | denotes cardinality. Given two tissues and a significance level a, 
we can compute the resemblance of the genes identified as age-related in the 
tissues. Resemblance is then a function of a. Plotting the numerator \ Ad B\ 
versus the denominator \AU B\ as a increases, we obtain curves depicting 
the strength of the overlap. 
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In our setting with 16 tissues there are 



I 1 ®) = 120 resemblances to con- 



sider. To keep the comparison manageable as well as to pool information 
from all tissues, we computed the following quantities: 

16 



We can think of I a /U a as a pooled resemblance. We would like to see large 
I a at each given level of U a . 

Figure 7 plots I a versus U a for the methods we are comparing. To make 
a precise comparison, we arranged for each method that estimated latent 
structure to employ the same estimate for the rank of the latent component. 
That rank is either 1, 2, 3 or the value chosen by the method of Buja and 
Eyuboglu (1992). At any rank LEAPP generates the most self-consistent 
gene lists over almost the entire range. EIGENSTRAT is usually second. 
SVA beats a raw method that makes no adjustments. LEAPP retains its 
strong performance when the rank is chosen from the data while the other 
two methods become poorer in that case. 

Resemblance across tissues could also be high if there exists latent vari- 
ables strongly correlated with age which are repeated across tissues. For 
example, consider a scenario where all tissues from young mice are in one 
batch, and all tissues from elder mice are in a different batch. If there are 
strong batch biases, then "age-related" genes would be reported by the raw 
method, and the same genes would be ranked high across all tissues. How- 
ever, note that raw performs the worst of all methods in Figure 7, which 
gives some reassurance that the high resemblance of the other methods is 
due to successful removal of latent variables. 

Given what we have learned from simulations, the relative performance 
of EIGENSTRAT and SVA gives us some insight into these data. Since 
EIGENSTRAT has done well, it is more likely that the signal is not very 
strong. Since SVA has done poorly, it is more likely that the latent variables 
in these data are correlated with age. There is also the possibility that they 
are correlated with sex (the covariate). Our simulations did not include a 
covariate. 

5. Conclusions. High throughput testing has performance that deterio- 
rates in the presence of latent variables. Latent variables that are correlated 
with the treatment variable of interest can severely alter the ordering of p- 
values. Our LEAPP method separates the latent variable from the treatment 
variable, making an adjustment possible. 

We have found in simulations that the adjustment brings about a better 
ordering among hypotheses than is available from either SVA or EIGEN- 
STRAT. The improvement over SVA is largest when the latent variable is 



where A® 



(4.1) 
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Resemblance across 16 tissues 
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Fig. 7. This figure shows the resemblance among significant gene sets from 16 tissues 
in the AGEMAP study. We plot I a versus U a [from equation (4-1)], increasing a from 
until U a = 700. The greatest self-consistency among lists is from LEAPP. EIGENSTRAT 
is second best. The baseline curve is computed assuming that the rankings for all 16 tissues 
are mutually independent. 



correlated with the primary one. The improvement over EIGENSTRAT is 
largest when the primary variable has a large effect. 

A referee asked about the case where the coefficients of 7 for the primary 
variable correlate over genes with the per gene latent variable, U in our 
notation. We have not simulated such a case. It might be very difficult 
for all methods or it might be comparable to the case where g correlates 
with V . It seems clear that if UV T matches 7<7 T closely enough, then it will 
be impossible to identify relevant genes in this model. 
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In the simulations reported here the data are drawn from the model under 
which LEAPP was derived. Sun (2011) also simulates the LEAPP, SVA 
and EIGENSTRAT algorithms on the model used by Price et al. (2006) to 
represent SNP association studies. The SNPs themselves are drawn from 
the Balding-Nichols model [Balding and Nicols (1995)]. Two scenarios were 
considered. In both, the LEAPP ROC curve placed above that for SVA 
which was above that for EIGENSTRAT. All methods were close when the 
relative risk for the causal allele was R = 1.5 while EIGENSTRAT lagged 
behind for the case with R = 3. 

On the AGEMAP data we found better consistency among tissues for 
significance estimated by LEAPP than for either SVA or EIGENSTRAT. 

Some applications may have features measured on the genes with per- 
sample covariates to be estimated statistically. Such terms can be included 
in the criss-cross regression framework but we have no experience fitting 
them. 

LEAPP produces p- values in addition to the relative ordering of the genes. 
In this paper we have only looked at the quality of the relative ordering. In re- 
sponse to a reviewer's query about calibration of p-values, we created a QQ- 
plot of test statistics Tj at (2.13) on simulated data (not shown) and found it 
very nearly linear. That simulated data was pure noise, having no regression 
or latent structure. For an investigation on real data, Sun [(2011), Chap- 
ter 4.5.2] considered the breast cancer data from Hedenfalk (2001). She finds 
that the test statistics produced by LEAPP have an empirical null distribu- 
tion from the R package locfdr [Efron (2008)] of A/"(0.012, 1.018 2 ) that closely 
matches the nominal null distribution. That is what we would expect to see 
if the nominal p- values coming out of LEAPP had the U[0, 1] distribution 
that they should have. Corresponding empirical nulls are Af(— 0.01, 1.55 2 ) 
for the RAW method, A/"(-0.009, 1.425 2 ) for SVA and A/"(-0.093, 1.199 2 ) 
for EIGENSTRAT. Thus, in addition to a general improved ordering of 
genes, this one example had p- values that are better calibrated in LEAPP 
than in SVA or EIGENSTRAT. 

APPENDIX 

Here we give some properties of our approach to testing many hypotheses 
in the presence of latent variables. We focus on a simpler version of the 
model that is more tractable: 

(A.l) Y = 7/ + UV T + aE, 

where g G M nxl with \\g\\ = 1 as before, U S R 7Vxfc is nonrandom, V E R nxfc 
has IID rows with E(y T l/) = ifc, known rank k and E ~ N(0,In <8> In)- 
Compared to the full model (2.1), equation (A.l) has no covariate term 
(3X T , and has constant variance X = a In. 
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This simplification allows us to apply results from the literature to our 
model. It removes the Monte Carlo based rank estimation step and the 
alternation between estimating £ and using t he est imate S. When k = 1, 
the primary to latent correlation is p = g T V/VV T V. 

Our algorithm requires the choice of a rotation matrix O such that Og = 
e\. There are multiple possibilities for this matrix. Our algorithm is invariant 
to the choice of O. 

Theorem A.l. Let Y follow the model (A.l). Then our estimates ofU 
and 7 do not depend on the rotation O used as long as Og = e\ . 

PROOF. See Sun (2011). □ 

It is not hard to extend the proof of Theorem A.l to account for the /3X T 
term. The criss-cross regression begins by computing (3 from sums of squares 
and cross-products. Those sums of squares and cross-products are invariant 
under the rotation. 

The following theorem provides a sufficient condition for our estimate U 
to consistently estimate U. We study the case where the data are generated 
with k = 1 and the model is also estimated using the correct rank k = 1. 
Then as long as the latent factor U is large enough compared to the noise 
level, we will be able to detect and estimate U fairly well. Our size measure 
IliJIlKl — p 2 )/n takes account of the correlation. With a higher p, more of 
the latent factor is removed from Y^' . 

We measure error by the cosine <£>(U, U) = U T U /(||f Ibll^lh) of the angle 
between U and U. The estimate U is determined only up to sign. Replacing 
U by — U causes a change from V to — V and leaves the model unchanged. 
We only need max(<I>({7, U),&(—U, U)) = \®{U, U)\ ->■ 1 for consistency. 

Theorem A. 2. Let Y follow the model (A.l) with k = l and H^HKl — p 2 )/ 
n — > oo and N(n)/n — > c £ (0, oo) as n — )■ oo. Let U be our estimator for U 
using k = l. Then | U) \ — > 1 as n — > oo with probability 1. 

PROOF. See Sun (2011). □ 

Next we give conditions for the final step of LEAPP to accurately esti- 
mate 7, that is, for ||7 — 7H2 to be small. To do this, we combine methods 
used in random matrix theory from Bai (2003) with methods used in com- 
pressed sensing in Candes and Randall (2006). 

In our simulations we found little difference between robust and nonro- 
bust versions of the O-IPOD algorithm. This is not surprising, since our 
simulations did not place nonzero ji preferentially at high leverage points 
(extreme Un). For our analysis we replace the robust O-IPOD algorithm by 
the Dantzig selector for which strong results are available. 
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Our algorithm was designed assuming that the primary variable g is not 
too strongly correlated with the latent variable V. In our analysis we also 
impose a separation between the effects 7 and the latent quantity U. Specif- 
ically, we assume that 7 is sparse and that U is not. 

The vector x is s-sparse if it has at most s nonzero components. Following 
Candes and Randall (2006), we define the sequences a s (A) and b s (A) as the 
largest and smallest numbers, respectively, such that 

a s (A)\\x\\ 2 <\\Ax\\ 2 <b s (A)\\x\\ 2 

holds for all s-sparse x. 

Theorem A. 3. Suppose that Y follows the model (A.l) with k = 1, 
a fixed correlation p E (—1,1) between g and V , and an s-sparse vector 7. 

Assume that N/n -)• cG (0,oo), V J V A 1, and (iVrz)" 1 1| t^||| ->■ a 2 u > hold 
as n —> 00. Let our estimated U beU and set U* = U/\\U\\2- Writing \ U^ \ > 
I Ufy | > * ■ * > | ^(tv) I f or ordered components of U* , assume that there is 
a constant < B < 1 such that 

i>(*)) 2 +^B^)) 2 ^ 

i=l i=l 
Then the Dantzig estimator 7, which minimizes 

|| 7 ||i subject to ||(I - U*U* T )(Y} r) - 7)11^ < a^N 

satisfies 

2 16c7 2 S log(iV) 

I7-7I2 < 



(l-p2)(l_ jB) 2- 

PROOF. See Sun (2011). □ 
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